Before we starting analyzing today’s data, let’s first talk about the notion of working directories.
The working directory is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. You can see your current working directory at the top of the console (yours may be different from mine):
You can also see it through the getwd()
function.
As you get more experienced and start handling more projects, it’s a good idea to organize your projects into directories and, when working on a project, set the working directory to the project’s directory. That way, you will know where to find your project files and you won’t mix files from different projects together.
You can change the working directory using the setwd()
function. An easier way is to go the menu bar and select Session > Set Working Directory
, then choose one of the options there.
Why write R scripts? R scripts facilitate easy storage, running and sharing of code. For example:
How to write R scripts? Just type commands in the window for the file, with each command on its own line.
It can be difficult to write an R script all at once. Instead, we can use the following workflow to make sure that script works as it should:
Run Selected Line(s)
(or using the Cmd-Enter
or Ctrl-Enter
shortcut). This action copies the code to the console and runs it.For all the code below, follow the workflow above, i.e. type it into the window for the R script, then run it in the console.
Today we’ll be working with an NBA player dataset that I downloaded from Kaggle. We will be working with a refined version of this dataset. For those who are interested, you can download the raw dataset at the Kaggle link and access the script I used to process the data here.
Download the NBA dataset from the course website. Next, make sure that the working directory is the folder where the NBA dataset is located. To load the dataset into R, click on the “Import Dataset” button in the “Environment” pane, then click “From Text (readr)…”
For “File/Url”, click the “Browse” button on the right and locate the NBA dataset. Within a short period of time, the “Data Preview”, “Import Options” and “Code Preview” sections are populated:
First, look at the “Code Preview” section. This is the code that R is using in order to produce the dataset seen in the “Data Preview”. It loads the readr
package, then uses the read_csv()
function to read in the data file.
Next, look at the “Data Preview” section. Notice how each column has a type associated with it. How does read_csv()
know what type each column is? From the documentation, read_csv()
looks at the first 1000 rows in the dataset and makes a guess. It’s often correct, but sometimes it’s not.
Now, if we click the “Import” button, R will execute the code in the “Code Preview” section in the console. This is usually not what we want to do since we want to keep any code we execute in a script. D this instead:
View
) and copy it (either by Ctrl-C
or Cmd-C
, or Right click > Copy
).read_csv(...)
is assigned to df
. Also amend library(readr)
to library(tidyverse)
: loading the tidyverse
package loads readr
as well as other packages that we will use today.You should end up with the code below (with comments added). Run it to import the dataset!
# load NBA dataset
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df <- read_csv("nba_tidy.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## player = col_character(),
## team = col_character(),
## college = col_character(),
## birth_state = col_character()
## )
## See spec(...) for full column specifications.
Consider what the code above does when someone else opens it on their computer. R looks for the file nba_tidy.csv
in the present working directory. Hence, anyone using this script must make sure this file is in the present working directory; if not an error will occur when the script is run.
Use the functions that you have learnt so far to examine the dataset. What does each row correspond to? How many rows and columns are there?
Each row in this dataset corresponds to a player that played at some point during the NBA 2016-2017 season. Here is a short summary of the variables in the dataset: many of them are standard statistics that are recorded for basketball games.
player
: Name of player.team
: Team that player was on for the season. If the player was on more than one team, this refers to the team that the player played the most games with.G
: No. of games played.GS
: No. of games played as a starter.MP
: Total minutes played.FG
: No. of successful field goals (i.e. 2-point or 3-point shots).FGA
: No. of field goals attempted.3P
: No. of successful 3-point shots.3PA
: No. of 3-point shots attempted.FT
: No. of successful free throws.FTA
: No. of free throws attempted.ORB
: No. of offensive rebounds.DRB
: No. of defensive rebounds.AST
: No. of assists.STL
: No. of steals.BLK
: No. of blocks.TOV
: No. of turnovers.PF
: No. of personal fouls.PTS
: Total points scored.height
: Height of player in centimeters.weight
: Weight of player in kilograms.college
: Where the player went to college.birth_year
: Year the player was born.birth_state
: State the player was born.Let’s add two more columns to this dataset: field goal percentage (i.e. percentage of field goals which were successful), and the age of the player in 2019.
df$FGpct <- df$FG / df$FGA * 100
df$age <- 2019 - df$birth_year
At this point, our data frame df
contains slightly different data. This may be something that we want to save to our local drive, so that in the future we can use this file directly, instead of loading the original one and making the changes.
We have 2 options for doing that. The first is to save it as a .csv file with readr
’s write_csv
function. Type the following in the console:
write_csv(df, "nba_tidy2.csv")
This saves the value of df
to the file nba_tidy2.csv
.
The second option is to save it into an .rds
file. Type the following in the console:
saveRDS(df, "nba_tidy2.rds")
(The saved file should appear in your working directory.) To read from an .rds
file, use the readRDS
function. The code below loads whatever is in nba_tidy2.rds
and assigns it to the variable df2
.
df2 <- readRDS("nba_tidy2.rds")
While write_csv
only works for data frames, saveRDS
works for any R object.
fct_recode()
Let’s look at just the 4 teams in california for now:
# look at the teams in california
ca_df <- df %>%
filter(team %in% c("GSW", "LAC", "LAL", "SAC"))
Which team attempted the most number of field goals? We can answer this question with some dplyr
work:
# total FG by team
ca_df %>%
group_by(team) %>%
summarize(tot_FG = sum(FG))
## # A tibble: 4 x 2
## team tot_FG
## <chr> <dbl>
## 1 GSW 3489
## 2 LAC 3242
## 3 LAL 3231
## 4 SAC 3066
For someone who doesn’t know basketball well, the team names as acronyms may not make sense. We can use fct_recode()
to replace the acronyms with the full name so that the output is more interpretable:
# relabel the team names
ca_df <- ca_df %>% mutate(team = fct_recode(team,
"Golden State Warriors" = "GSW",
"Los Angeles Clippers" = "LAC",
"Los Angeles Lakers" = "LAL",
"Sacramento Kings" = "SAC"))
# total FG by team
ca_df %>%
group_by(team) %>%
summarize(tot_FG = sum(FG))
## # A tibble: 4 x 2
## team tot_FG
## <fct> <dbl>
## 1 Golden State Warriors 3489
## 2 Los Angeles Clippers 3242
## 3 Los Angeles Lakers 3231
## 4 Sacramento Kings 3066
In fct_recode()
, the new level names are on the left while the old level names are on the right. It’s possible to have the same level name appear more than once on the left: this causes different levels to be grouped together. Also, any old level names that don’t appear on the right remain untouched.
fct_collapse()
There are a total of 30 NBA teams, and they are grouped into 6 divisions (roughly by geography), each with 5 teams. To add this information into the data frame, we can do so using fct_collapse()
:
# create division column
df <- df %>% mutate(division = fct_collapse(team,
Atlantic = c("BOS", "BRK", "NYK", "PHI", "TOR"),
Central = c("CHI", "CLE", "DET", "IND", "MIL"),
Southeast = c("ATL", "CHO", "MIA", "ORL", "WAS"),
Northwest = c("DEN", "MIN", "OKC", "POR", "UTA"),
Pacific = c("GSW", "LAC", "LAL", "PHO", "SAC"),
Southwest = c("DAL", "HOU", "MEM", "NOP", "SAS")))
This allows us to answer questions at the division level, e.g. how many points did players in each division score?
# most points by division
df %>% group_by(division) %>%
summarize(tot_pts = sum(PTS)) %>%
arrange(desc(tot_pts))
## # A tibble: 6 x 2
## division tot_pts
## <fct> <dbl>
## 1 Pacific 44298
## 2 Atlantic 43599
## 3 Southwest 43470
## 4 Northwest 43449
## 5 Central 42857
## 6 Southeast 42080
fct_lump()
Which college produced the most number of NBA players? Again, this can be answered using dplyr
functions (we should filter out the players who have NA
for college):
# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>%
group_by(college) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 108 x 2
## college count
## <chr> <int>
## 1 University of Kentucky 24
## 2 Duke University 18
## 3 University of Kansas 14
## 4 Syracuse University 12
## 5 University of California, Los Angeles 12
## 6 Louisiana State University 9
## 7 University of Arizona 9
## 8 University of Florida 9
## 9 Michigan State University 8
## 10 University of North Carolina 8
## # … with 98 more rows
From the summary, we can see that there are a total of 108 colleges represented. If we wanted to see just the top 10, we could add head()
to the pipe, but then we don’t know how many other players there were for other colleges.
# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>%
group_by(college) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(n = 10)
## # A tibble: 10 x 2
## college count
## <chr> <int>
## 1 University of Kentucky 24
## 2 Duke University 18
## 3 University of Kansas 14
## 4 Syracuse University 12
## 5 University of California, Los Angeles 12
## 6 Louisiana State University 9
## 7 University of Arizona 9
## 8 University of Florida 9
## 9 Michigan State University 8
## 10 University of North Carolina 8
Instead, we could change the college
variable using fct_lump()
, which lumps the least common factor levels together into an “Other” category. By specifying n = 10
, we tell fct_lump()
to keep the most common n = 10
values.
# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>%
mutate(college = fct_lump(college, n = 10)) %>%
group_by(college) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 11 x 2
## college count
## <fct> <int>
## 1 Other 225
## 2 University of Kentucky 24
## 3 Duke University 18
## 4 University of Kansas 14
## 5 Syracuse University 12
## 6 University of California, Los Angeles 12
## 7 Louisiana State University 9
## 8 University of Arizona 9
## 9 University of Florida 9
## 10 Michigan State University 8
## 11 University of North Carolina 8
This summary tells us that the vast majority of players don’t come from the top 10 colleges represented.
fct_infreq()
and fct_rev()
How many players were there in each division? We can answer this question with a bar plot:
ggplot(df) +
geom_bar(aes(x = division))
The bars don’t seem to be arranged in an intuitive order. fct_infreq()
allows us to arrange them by frequency:
ggplot(df) +
geom_bar(aes(x = fct_infreq(division)))
This orders the bars from tallest to shortest. If we want to order them from shortest to tallest, we can invert the factor ordering using fct_rev()
:
ggplot(df) +
geom_bar(aes(x = fct_rev(fct_infreq(division))))
The code can be written more elegantly using pipe notation:
ggplot(df) +
geom_bar(aes(x = division %>% fct_infreq() %>% fct_rev()))
fct_reorder()
Which team attempted the most number of free throws? What was the distribution of free throw attempts like? We can make a plot of number of free throws attempted by team. Notice that I have swapped the x
and y
axes here to make the plot easier to read.
# most freethrows (unordered)
df %>% group_by(team) %>%
summarize(total_FTA = sum(FTA)) %>%
ggplot() +
geom_point(aes(x = total_FTA, y = team))
The teams are ordered alphabetically, with ATL at the bottom and WAS on top. This makes it easy to locate a specific team of interest, but it makes it difficult to tell where each team is in relation to the others. We can use fct_reorder()
to order the teams based on their total free throws attempted values:
# most freethrows (ordered)
df %>% group_by(team) %>%
summarize(total_FTA = sum(FTA)) %>%
ggplot() +
geom_point(aes(y = fct_reorder(team, total_FTA), x = total_FTA))
From this, it is clear that PHO had the most number of free throws while DAL had the least. We can also see a clear break between DAL and DET and the rest of the teams.
Which is the oldest team? The data visualization below gives a boxplot of age for each team (we remove NA
s first and flip axes for readability):
# age of players by team (unordered)
df %>%
filter(!is.na(age)) %>%
ggplot() +
geom_boxplot(aes(x = team, y = age)) +
coord_flip()
Again, the teams are ordered alphabetically. If we replace x = team
with x = fct_reorder(team, age)
, then the team
variable will be ordered by the age
values. By default, it will order them by the median of the age
values: we can see this by comparing the lines in the middle of the boxplots.
# age of players by team (ordered)
df %>%
filter(!is.na(age)) %>%
ggplot() +
geom_boxplot(aes(x = fct_reorder(team, age), y = age)) +
coord_flip()
Below, we order the teams by the maximum age on each team instead.
# age of players by team (ordered)
df %>%
filter(!is.na(age)) %>%
ggplot() +
geom_boxplot(aes(x = fct_reorder(team, age, max), y = age)) +
coord_flip()
We can use fct_reorder()
for bar plots as well if we are using geom_col()
, not geom_bar()
. For example, say we want a visualization of the top 10 point scorers for the season, with the bar color depicting the player’s field goal percentage. This is an initial attempt without ordering the players:
# top 10 scorers (unordered)
df %>% top_n(n = 10, wt = PTS) %>%
ggplot() +
geom_col(aes(x = player, y = PTS, fill = FGpct)) +
coord_flip()
We may try to arrange the dataset in the order we want before passing it to ggplot()
, but that doesn’t work: the players will still be ordered in their default order, i.e. alphabetically.
# top 10 scorers (ordered: doesn't work!)
df %>% top_n(n = 10, wt = PTS) %>%
arrange(desc(PTS)) %>%
ggplot() +
geom_col(aes(x = player, y = PTS, fill = FGpct)) +
coord_flip()
We can use fct_reorder()
to order the players:
# top 10 scorers (ordered)
df %>% top_n(n = 10, wt = PTS) %>%
arrange(desc(PTS)) %>%
ggplot() +
geom_col(aes(x = fct_reorder(player, PTS), y = PTS, fill = FGpct)) +
coord_flip()
Once the basic plot is done, we can add some bells and whistles to make the plot more informative and appealing:
# top 10 scorers (ordered: nicer)
df %>% top_n(n = 10, wt = PTS) %>%
arrange(desc(PTS)) %>%
ggplot() +
geom_col(aes(x = fct_reorder(player, PTS), y = PTS, fill = FGpct)) +
scale_fill_gradient(low = "orange", high = "blue") +
coord_flip() +
labs(title = "Top 10 players with most points",
x = NULL, y = "Points")
ggplot2
and dplyr
practiceBelow are some exercises for you to practice your dplyr
skills. All of them use the df
dataset as the starting point. (It doesn’t matter if the division
column has been created or not; it will not affect the results below.) Solutions are in the section below.
Plot a histogram of minutes played (MP
).
Make a scatterplot of FGpct vs. FGA. Set alpha
to 0.5.
Make the same plot as above, but only include players with at least 200 field goal attempts (i.e. FGA >= 200
). Add a geom_smooth()
layer to the plot.
Who are the top ten players by FGA
? Give just the players’ names, FGA and FGpct values.
ggplot2
and dplyr
practice (solutions)MP
).ggplot(df) +
geom_histogram(aes(x = MP))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
alpha
to 0.5.ggplot(df) +
geom_point(aes(x = FGA, y = FGpct), alpha = 0.5)
## Warning: Removed 1 rows containing missing values (geom_point).
FGA >= 200
). Add a geom_smooth()
layer to the plot.df %>% filter(FGA >= 200) %>%
ggplot(aes(x = FGA, y = FGpct)) +
geom_point(alpha = 0.5) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
FGA
? Give just the players’ names, FGA and FGpct values.df %>% arrange(desc(FGA)) %>%
head(n = 10) %>%
select(player, FGA, FGpct)
## # A tibble: 10 x 3
## player FGA FGpct
## <chr> <dbl> <dbl>
## 1 Russell Westbrook 1941 42.5
## 2 Andrew Wiggins 1570 45.2
## 3 DeMar DeRozan 1545 46.7
## 4 James Harden 1533 44.0
## 5 Anthony Davis 1527 50.4
## 6 Damian Lillard 1488 44.4
## 7 Karl-Anthony Towns 1479 54.2
## 8 Isaiah Thomas 1473 46.3
## 9 Kemba Walker 1449 44.4
## 10 Stephen Curry 1443 46.8
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [5] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.1
## [9] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.1
## [5] tools_3.6.1 zeallot_0.1.0 digest_0.6.20 lubridate_1.7.4
## [9] jsonlite_1.6 evaluate_0.14 nlme_3.1-140 gtable_0.3.0
## [13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
## [17] rstudioapi_0.10 yaml_2.2.0 haven_2.1.1 xfun_0.9
## [21] withr_2.1.2 xml2_1.2.2 httr_1.4.1 knitr_1.24
## [25] vctrs_0.2.0 generics_0.0.2 hms_0.5.1 grid_3.6.1
## [29] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 fansi_0.4.0
## [33] readxl_1.3.1 rmarkdown_1.15 modelr_0.1.5 magrittr_1.5
## [37] ellipsis_0.3.0 backports_1.1.4 scales_1.0.0 htmltools_0.3.6
## [41] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [45] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [49] broom_0.5.2 crayon_1.3.4